The following article is Open access

MAGIC: Microlensing Analysis Guided by Intelligent Computation

and

Published 2022 October 14 © 2022. The Author(s). Published by the American Astronomical Society.
, , Citation Haimeng Zhao and Wei Zhu 2022 AJ 164 192 DOI 10.3847/1538-3881/ac9230

Download Article PDF
DownloadArticle ePub

You need an eReader or compatible software to experience the benefits of the ePub3 file format.

1538-3881/164/5/192

Abstract

The modeling of binary microlensing light curves via the standard sampling-based method can be challenging, because of the time-consuming light-curve computation and the pathological likelihood landscape in the high-dimensional parameter space. In this work, we present MAGIC, which is a machine-learning framework to efficiently and accurately infer the microlensing parameters of binary events with realistic data quality. In MAGIC, binary microlensing parameters are divided into two groups and inferred separately with different neural networks. The key feature of MAGIC is the introduction of a neural controlled differential equation, which provides the capability to handle light curves with irregular sampling and large data gaps. Based on simulated light curves, we show that MAGIC can achieve fractional uncertainties of a few percent on the binary mass ratio and separation. We also test MAGIC on a real microlensing event. MAGIC is able to locate degenerate solutions even when large data gaps are introduced. As irregular samplings are common in astronomical surveys, our method also has implications for other studies that involve time series.

Export citation and abstract BibTeX RIS

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

When a distant star (called the source) gets sufficiently aligned with a massive foreground object (called the lens), the gravitational field of the lens focuses the light out of the distant star, thus making the distant star appear brighter (Einstein 1936; Paczynski 1986). For a typical source star inside the Milky Way, one can observe the time evolution of their brightness (i.e., light curves) and infer the existence and properties of companion objects to the lens by monitoring the deviations in the light curve from the single-lens scenario (e.g., Mao & Paczynski 1991; Gould & Loeb 1992). This so-called gravitational microlensing technique has been frequently used to detect exoplanets and stellar binaries, and is complementary to other techniques (see reviews by Gaudi 2012 and Zhu & Dong 2021).

For a microlensing event with multiple lenses, the interpretation of the light curve can be challenging. First of all, the computation of the multiple-lens microlensing light curve can be time consuming due to the finite-source effect (e.g., Dong et al. 2006; Bozza 2010). This is especially true when the microlens system consists of three or more objects (e.g., Gaudi et al. 2008; Kuang et al. 2021). Additionally, the likelihood landscape of the high-dimensional parameter space can be so pathological that traditional sampling-based methods may have a hard time searching for the correct solution (or solutions). This remains to be true even when the brute force search on a fine grid that is defined by a subset of model parameters is conducted. As a result, the current analysis of multiple-lens microlensing events is still case by case, with each event requiring hundreds of (or more) CPU hours as well as (usually) the supervision of domain experts (e.g., Dominik 1999; Gould & Han 2000; An 2005; Song et al. 2014; Zang et al. 2022; Zhang et al. 2022).

Several methods have been proposed to automate the parameter estimation problem in microlensing events with two lenses (Vermaak 2003; Khakpash et al. 2019; Kennedy et al. 2021; Zhang et al. 2021). These binary-lens events are the overwhelming majority of all anomalous events in current microlensing surveys, as events with more lenses are expected to be rare (Zhu et al. 2014). Of the proposed methods, machine learning is a promising approach that may efficiently and accurately infer parameters out of binary microlensing light curves (Vermaak 2003; Zhang et al. 2021), in addition to its application to the identification (or classification) of microlensing events (Wyrzykowski et al 2015; Godines et al. 2019; Mróz 2020). However, the existing machine-learning methods for parameter inferences are not readily applicable to microlensing data from ongoing surveys. In particular, Zhang et al. (2021) trained and evaluated their deep learning network on simulated light curves from the Roman microlensing survey (Penny et al. 2019). While their model can successfully predict the posterior distributions of microlensing model parameters and identify degenerate solutions (Zhang et al. 2022), their simulated light curves contain ∼104 time stamps with equal steps and high photometric precision. Ongoing microlensing surveys from the ground, such as the Korea Microlensing Telescope Network (KMTNet; Kim et al. 2016), see light curves with lower signal-to-noise ratios (S/Ns) and irregular samplings (including data gaps due to seasonal, diurnal, and instrumental issues). These realistic features cannot be handled in the framework of Zhang et al. (2021).

Irregular time series are common in astronomical observations from the ground (and sometimes from space as well). These irregular time steps pose challenges to the family of recurrent neural networks (RNNs; Goodfellow et al. 2016), which is widely considered the standard machine-learning method to deal with time series data. Variants of RNNs have been invented to handle irregular time series (Che et al. 2018). For example, Charnock & Moss (2017) imputed/interpolated on the originally unevenly sampled light curves and obtained evenly sampled ones, whereas Naul et al. (2018) added the time steps between consecutive observations as additional information to the standard RNN. Neither of these approaches would work well in the present situation. As the source enters or exits the caustic curve of the binary lens, microlensing light curves may change behaviors within brief time intervals. If these intervals fall into the data gap, it will be difficult for the standard interpolation methods to make up the missing features without introducing false signals (Rubanova et al. 2019).

In this paper, we present Microlensing Analysis Guided by Intelligent Computation, or MAGIC for short. MAGIC is a machine-learning framework that can efficiently and accurately perform parameter inference in binary microlensing events with realistic sampling conditions. A schematic overview of the proposed framework is shown in Figure 1. The binary microlensing parameters are explained in Section 2. We divide these parameters into two groups and develop different machine-learning methods in Sections 3 and 4. Section 5 describes the joint pipeline and its application to a real microlensing event. Finally, we discuss our results in Section 6.

Figure 1.

Figure 1. The schematic view of MAGIC. The binary microlensing parameters are divided into two groups. We use locator, which is a U-Net architecture, to infer the time-dependent parameters, namely the peak time t0 and the event timescale tE. The output is used to transform the light curve into a function of τ ≡ (tt0)/tE, and estimator, which combines neural controlled differential equation (neural CDE) and mixture density network (MDN), is used to infer the other microlensing parameters. We refer to Sections 3, 4, and 5 for a detailed description of our method.

Standard image High-resolution image

2. Binary Microlensing Parameters

We describe the parameters that are required to model the binary microlensing events. These parameters are grouped and inferred separately.

The standard binary microlensing light curve can be defined by the following parameters (see Gaudi 2012 and references therein)

Equation (1)

Here t0 is the time of the closest approach between the source and lens under the rectilinear trajectory, u0 is the separation between the source and the lens at t0, ρ is the size of the source star, and tE is the event timescale

Equation (2)

where μrel is the relative proper motion between the lens and the source. The Einstein ring radius θE is given by

Equation (3)

where κ ≡ 4G/c2au ≈ 8.14 mas/M, ${\pi }_{\mathrm{rel}}\equiv \mathrm{au}({D}_{{\rm{L}}}^{-1}-{D}_{{\rm{S}}}^{-1})$, with ML and DL the mass of and distance to the lens object/system, respectively, and DS the distance to the source. Both u0 and ρ are normalized to the Einstein radius θE.

Three of the remaining parameters, (q, s, α), characterize the binary component. Here q is the mass ratio between the secondary and the primary components, s is the projected separation between the two binary components normalized to θE, and α measures the angle from the binary axis to the direction of the lens–source relative motion. We follow the convention of Skowron et al. (2011) in the definition of the microlensing parameters except for α, for which our definition is offset by 180°. 3

The baseline magnitude of the microlensing light curve is given by mbase, and the fraction of the total baseline flux coming from the microlensing source is given by fS. That is,

Equation (4)

where A(ux , uy ) is the light curve in microlensing magnifications. The quantity, 1 − fS, represents the blending fraction, namely the fraction of total flux coming from stars/objects other than the source. The parameters ux and uy measure the displacement from the center of mass of the binary (i.e., the chosen origin) to the source at time t. In the chosen reference frame,

Equation (5)

In principle, since microlensing observations are usually taken by telescopes at different sites or in different filters, each event may have multiple sets of mbase and fS parameters. In practice, as long as the different data sets cover a broad range of microlensing magnifications, they can all be aligned to a single data set to relatively high precision without reference to specific microlensing models (e.g., Yoo et al. 2004; Gould et al. 2010a). We therefore only include one set of mbase and fS in the present work.

The binary microlensing parameters, given by Equation (1), are treated/inferred differently in the present work. First of all, it is reasonable to assume that the baseline magnitude mbase can be immediately read out of the ground-based light curve, which usually monitors (or can monitor) the source star for a time much longer than the event timescale. We therefore do not include mbase in our method. The scaled source size, ρ, is constrained only when the source is very close to or crosses the caustic curve of the binary lens. For the chosen (effective) observational cadence of this work, the finite-source effect is not expected to be widely detected. We therefore fix ρ = 10−3 and do not infer it. As will be shown later, our inference of the other parameters is not affected by whether or not the finite-source effect is present.

Of the remaining parameters, t0 and tE define the position and extent of the light curve in time, whereas the other parameters, (u0, q, s, α, fS), concern primarily the shape of the binary microlensing light curve. We show in Sections 3 and 4 that these two groups of parameters can be inferred separately. It is also worth noting that, once the other parameters are chosen, the source flux parameter, fS, can be derived analytically via the maximum likelihood method (Gould 1995). In Section 4, we make use of this property in the further optimization of the microlensing parameters.

Throughout this work, the VBBinaryLensing algorithm (VBBL; Bozza 2010; Bozza et al. 2018) inside the MulensModel (Poleski & Yee 2019) package is used to generate binary microlensing light curves. We describe the generation of the training data set in the two different networks separately.

3. Inference of Time-dependent Parameters

3.1. Method

Together with u0, parameters t0 and tE are the so-called Paczynski parameters (Paczynski 1986). Unlike the parameters that define the binary properties, these Paczynski parameters are constrained by the global properties of the light curve. In many low-q events, especially planetary events, one can accurately constrain the Paczynski parameters by masking out the relatively brief anomalous features (e.g., Gould & Loeb 1992). Of all three Paczynski parameters, the time-dependent ones, namely t0 and tE, enter the microlensing phenomenon via the parameter τ (see Equation (5)). For these reasons, we develop a stand-alone machine-learning method to infer the time-dependent parameters, which is dubbed locator.

The basis of locator is the U-Net architecture. U-Net is a fully convolutional neural network that is originally designed to perform semantic segmentation of images (Ronneberger et al. 2015). U-Net takes the original image as the input, and outputs a mask of the same shape. The mask can be interpreted as a pixel-wise classification result, representing, for example, the probability of each pixel belonging to some specific segment. U-Net has been ubiquitously used in biomedical image processing (Siddique et al. 2021) since it was proposed in 2015. It has also seen a growing number of applications in astronomy (e.g., Van Oort et al. 2019; Makinen et al. 2021).

The architecture of our locator closely follows that of (Ronneberger et al. 2015). See their Figure 1 for an illustration. In detail, locator differs in three aspects. First, all convolutional layers are one-dimensional, because locator deals with time series data rather than image data. Second, the pure convolutional blocks are replaced by residual blocks (He et al. 2016) with Parametric Rectified Linear Unit (PReLU, He et al. 2015) activation functions to ease gradient back propagation. Third, we set the channel and kernel sizes to 128 and 7, respectively, to better match the shape of the input data. In practice, the performance of locator is not sensitive to the values of these hyperparameters.

We design locator to identify the light-curve segment that falls within a predefined time window ΔT ≡ [t0ktE, t0 + ktE]. Here k is a parameter that can be tuned to optimize the output performance, and its choice will be discussed in Section 3.3. For each light curve, which is irregularly sampled (see Section 3.2), we first interpolate it with natural cubic splines and extract the values at nt = 4000 4 equally stepped time stamps ${\{{t}_{j}\}}_{j=1}^{{n}_{t}}$. 5 These time stamps are assigned labels depending on whether or not they are inside ΔT,

Equation (6)

At run time, locator takes the time series and attempts to produce a mask ${\{{p}_{j}\}}_{j=1}^{{n}_{t}}$ that best approximates the ground truth labels.

We train locator on simulated light curves that match the key properties of real microlensing light curves (see Section 3.2). The loss function that locator minimizes is given by

Equation (7)

Here ${{ \mathcal L }}_{\mathrm{BCE}}$ is the binary cross entropy (BCE) loss function

Equation (8)

which is the standard loss function for binary classification (Goodfellow et al. 2016). The expectation runs over all time stamps of all light curves in the training set. The second term in Equation (7) is the Dice loss function (Sudre et al. 2017), which is introduced to assuage the class imbalance problem in events with small tE:

Equation (9)

The third term in Equation (7) is the regularization term to prevent the mask from wiggling up and down more than twice

Equation (10)

The expectations in the last two terms are performed on all light curves in the training set.

The output mask from locator contains information about the event peak time, t0, and the timescale, tE. Specifically, the centroid of the mask corresponds to t0 and the total enclosed area is directly related to tE. In the continuum form, they are given by

Equation (11)

These integrals are evaluated via the trapezoidal rule in the discrete time series.

3.2. Data Generation

To simulate light curves for the training of locator, we randomly sample 2000 time stamps out of a total duration of 270 days. This corresponds to an averaged cadence of ∼3 hr, with the lower and upper 10th percentiles corresponding to 20 minutes and 7.2 hr. For each light curve, we randomly draw tE from a truncated log-normal distribution with a mean of 101.15 days, a standard deviation of 100.45 days, and boundaries at 5 and 100 days. This log-normal distribution matches approximately the observed distribution of tE (e.g., Mróz et al. 2017), and the truncations are introduced to eliminate relatively short-timescale and long-timescale events, both of which may be subject to additional issues (e.g., parallax effect for long events) and are not the focus of the present work. As for t0, even though it can be any value during (and possibly outside) the sampled time interval, it is reasonable to believe that a very rough by-eye estimate can limit t0 to within ∼±tE of the truth. We therefore randomly draw t0 out of a uniform distribution between − tE and tE. To summarize,

Although the goal of locator is to infer t0 and tE and not the other parameters, the training data set should be representative of the target—the binary light curves. We therefore sample the remaining parameters in the following way

and ρ is fixed at 10−3, given that the finite-source effect is not expected to be detected for the adopted (effective) cadence. Note that the mass ratio q is limited to > 10−3 simply because we want to focus on relatively massive binaries. With increased sampling cadence locator can in principle be extended down to lower q values. We defer a detailed exploration to future works.

With the sampled binary parameters and time stamps, we calculate the microlensing light curve using the VBBL algorithm (Bozza 2010; Bozza et al. 2018). To introduce noises to the simulated light curves, we adopt a simplified noise model. Gaussian noises with S/Ns of 33 in flux are assumed for all data points, and the flux values are converted to magnitudes for a baseline object with mbase = 18.

Simulated light curves without significant binary features, defined as ${\chi }_{\mathrm{pspl}}^{2}$/d.o.f. <1.25, 6 are rejected. In the end, two data sets each containing 105 light curves are generated, one for training and the other for validation and test.

3.3. Performance of Locator

For our default run, we set k = 1/3 so that ΔT with the longest timescale (100 days) can still be fully covered in the simulation interval [−135, 135]. We train locator with the ADAM optimizer (Kingma & Ba 2014). The learning rate is set to 4 × 10−3 initially and drops by ten percent after each epoch. Training samples are divided into minibatches of size 128. After 13 epochs (∼2 hr) of training on one NVIDIA Tesla V100 GPU, the average loss drops to 0.16 on the validation set of size 1024. The optimized model is then used to predict the mask and estimate t0 and tE according to Equation (11).

We evaluate locator on 16,384 light curves from the test set. The output masks are illustrated in Figures 2 and 3, and the comparisons between the predicted and ground truth values of lgtE and t0/tE are shown in Figure 4. The accuracy of the prediction is measured by the rms error (RMSE) between the ground truth and the prediction. They are indicated in the upper left corner respectively. We note that locator has very different performances on close (s < 1) binaries and wide (s > 1) binaries. This is especially true for the t0/tE parameter. We take the results of close binaries as a more reliable evaluation of the overall performance and defer to the end of this section a more detailed explanation of this phenomenon.

Figure 2.

Figure 2. An illustration of the output mask of locator. The two-dimensional histogram shows the distribution of p values inside the chosen τ window for 16,384 light curves with s < 1. The red dashed lines indicate τ = ±k = ±1/3, which are the boundaries of the input mask in this realization.

Standard image High-resolution image
Figure 3.

Figure 3. Similar to Figure 2, but for events with s > 1.

Standard image High-resolution image
Figure 4.

Figure 4. Comparisons between input and predicted values of tE (left) and t0/tE (right). Values from events with s > 1 and s < 1 are separated, with the former shown in red and the latter in blue. The rms errors (RMSEs) of the parameters are computed and indicated at the upper left corners. In this realization, k = 1/3 is used.

Standard image High-resolution image

In the default run, locator achieves an accuracy in lgtE with RMSE = 0.049, which translates to a fractional uncertainty of 11.3% on tE. The accuracy of the ratio t0/tE is comparable at RMSE (s < 1) = 0.062, which corresponds to an uncertainty of 1.9 days on t0 for a typical event with tE = 30 days. An example event is shown in Figure 5, in which the light curve substantially deviates from the Paczynski (1986) model due to the caustic-crossing feature and yet locator successfully recovers t0 and tE at high accuracy.

Figure 5.

Figure 5. An example event in which locator successfully reproduces the input window and thus recovers t0 and tE. The upper panel shows the microlensing geometry, with the red asterisk marking the center of mass (c.o.mass) of the binary lens. The lower panel shows the simulated light curve and the input and output masks. The parameters used to reproduce this event are given in the upper-right corner.

Standard image High-resolution image

As shown in the left panel of Figure 4, the scattering in lgtE increases for shorter events. This is probably because, as the timescale decreases, fewer data points are available to locator in the parameter inference. For example, with k = 1/3 there are only ∼25 time stamps that fall within ΔT for the shortest (tE = 5 days) event. To improve the performance of locator on short events, we select light curves with tE < 40 days and train another model with k = 2. After eight epochs of training with the same learning rate schedule, the loss drops to 0.083 on the validation set. We evaluate the new model on 16,384 light curves with tE < 40 days. 7 The results are shown in Figure 6. The prediction accuracy on the timescale increases by a factor of ∼2, reaching a fractional uncertainty on tE of 6.2%. The performance on the ratio t0/tE remains largely unchanged.

Figure 6.

Figure 6. Similar to Figure 4 except that only events with tE < 40 days and k = 2 are used.

Standard image High-resolution image

The above investigation has two implications. First, it outlines a general strategy to obtain a more accurate tE. For example, one may divide the overall range of tE into multiple smaller intervals and train separate locator models. The preliminary estimate of tE from the default run of locator can then be "polished" by the locator model that is optimized for the narrower tE interval including the preliminary estimate. Additionally, the comparison in locator performance between two k values also sheds light on how locator works. In the single-lens case, the most constraining power on t0 comes from the central region with a width of ∼u0 tE, whereas the event timescale tE is best constrained by data points extending into the light-curve wings. This seems to apply to locator as well. As shown in Figure 7, the timescale parameter can be better constrained when a wider mask window (i.e., a larger k) is used, whereas the event peak time t0 is largely unaffected by the choice of k. More works are needed to better understand the mechanism of locator (and U-Net in general).

Figure 7.

Figure 7. The performance of locator, evaluated on RMSEs of lgtE, t0/tE, and t0, as a function of k values.

Standard image High-resolution image

We have fixed the magnitude of the baseline object at 18 and used S/N = 33 for all data points in all simulated events. In reality, the baseline magnitude varies from event to event, and the photometric uncertainty varies from data point to data point. However, as discussed previously, the choice of the baseline magnitude does not really matter to the performance of locator, as long as the photometric observations cover a long enough time baseline. We therefore only focus on the impact of varying S/N on the performance of locator. Rather than simulating the realistic noise conditions, 8 we study the impact of different training strategies. Several data sets with different S/N values are generated following the method in Section 3.2. We then train locator models on light curves with a fixed S/N value and then evaluate the performance on light curves from the same data set. The result of this training strategy is shown as the blue points in Figure 8. Here we only show the RMSE values of t0 as the trend is very similar for tE. Next, we train one locator model on events with the lowest quality (S/N = 10) and apply it to all data sets, the result of which is shown in orange in Figure 8. In the third test, we train locator on events with the highest quality (S/N = 1000) and apply it to all data sets. This result is shown in green in the same figure. As expected, locator performs the best if it is trained and evaluated on light curves with the same (or very similar) data quality. However, the performance of locator is only slightly worse if it is trained on the noisiest data set. This is probably because locator, similar to other convolutional neural networks (Goodfellow et al. 2016), is effectively composed of a series of filters learned from the data. When faced with noisy data, it learns to denoise through training, so the relatively large noises in the training data make locator more robust. For the same reason, the accuracy of the predicted t0 is substantially worse on noisy light curves if locator only saw during the training light curves of the highest quality. For applications to real events, it seems that training one locator on relatively noisy data may be a good strategy that balances efficiency and accuracy.

Figure 8.

Figure 8. The performance of locator, evaluated on RMSE of t0, as functions of the photometric uncertainty (S/N) of the individual data point. Three different training strategies are used. In blue points, locator models are trained and evaluated on data sets with given S/N values separately. In orange points, locator is trained on the data set with the lowest S/N and then applied to all data sets. In green points, locator is trained on the data set with the highest S/N and then applied to all data sets.

Standard image High-resolution image

Finally, we discuss the issue related to wide (s > 1) binaries. As pointed out earlier, the match between the ground truth labels and the predictions of locator for wide binaries is not as good as for close binaries. This larger mismatch, however, does not mean that the inference power of locator is reduced for wide binaries. After visually inspecting a large number of wide binary light curves, we find that the mismatch is related to the use of the reference system. When generating the light curves, we followed the common convention and adopted the center of mass of the binary as the origin of the coordinate system. However, for wide binaries, the external shear due to a component at wide separation is better modeled in the reference frame defined on the center of magnification. For s > 1, the offset between the center of mass and the closer center of magnification is (e.g., Di Stefano & Mao 1996; Chung et al. 2005; Penny 2014)

Equation (12)

For very wide binaries, the event timescale given by the mass of the closest lens also takes over the timescale given by the total mass of the binary. Both can lead to large offsets between the ground truth window and the predicted window. The most extreme example that has the largest deviation in t0 and tE with q = 0.99 and s = 2.8 is shown in Figure 9. Once the reference frame defined on the center of magnification is adopted, the match between the ground truth and the prediction of locator is substantially improved (see also Zhang et al. 2021).

Figure 9.

Figure 9. An example event with s > 1, in which locator "fails" to reproduce the input window. The input window (red dashed curve) is defined on the center of mass (red asterisk in the upper panel) of this wide binary system. The prediction of locator (shown in green) better matches the light-curve window (blue dashed curve) defined in the center of magnification (blue triangle in the upper panel) of the primary star (black plus on the left in the upper panel, with the one on the right being the secondary). This latter window better describes the time evolution of this wide binary event.

Standard image High-resolution image

4. Inference of Time-independent Parameters

4.1. Method

We develop a separate method to infer the time-independent microlensing parameters, which we reparameterize as

Equation (13)

As shown in Figure 1, this estimator consists of two main parts. A neural controlled differential equation (neural CDE) is used to extract features from the microlensing light curve, which is irregularly sampled and may contain large data gaps. The second part, the mixture density network (MDN), infers the probability distribution of the microlensing parameters.

To improve the training speed and reduce the memory requirement, we first apply the log-signature transform (Morrill et al. 2021). This transform outputs a shorter, steadier, and higher dimensional sequence that summarizes the substep information. For the simulated light curves with 500 time stamps (see Section 4.2), we set the log-signature depth nD = 3 and the shortened length l = 100. The output thus has v = 5 dimensions.

Neural CDE approximates the underlying process of the time series as a differential equation (Kidger et al. 2020; Kidger 2022)

Equation (14)

Here the latent state zt is defined on the time interval of the input signal, (t0, tn ), with w = 32 dimensions. The continuous signal Xt , defined on the same time interval, is obtained by applying natural cubic splines to the input, discrete signal. 9 The initial value and signal derivative of the latent state, ξθ and fθ , are two different neural networks, where the subscript θ denotes the dependence on the learnable machine-learning parameters. By Equation (14), neural CDE gradually extracts features from the signal Xt according to a learned policy fθ , records its knowledge in the latent state zt , and outputs a terminal value ${z}_{{t}_{n}}$ that summarizes the useful information of the input signal. We refer interested readers to Kidger et al. (2020) and Kidger (2022) for further details on neural CDE.

The summary features from neural CDE are then fed into the MDN (Bishop & Nasrabadi 2006) to estimate the probability distribution of the microlensing parameters

Equation (15)

Here ${\phi }_{\bar{\omega },{\rm{\Sigma }}}$ denotes the density of a multivariate Gaussian with mean $\bar{\omega }$ and covariance matrix Σ. The normalized weight πi , mean ${\bar{\omega }}_{i}$, and covariance matrix Σi of each of the nG multivariate Gaussians are given by the neural network of MDN. Compared to the masked autoregressive flow method of Zhang et al. (2021), MDN provides a straightforward and efficient alternative with explicit density estimation, which is useful for the identification of degenerate solutions and further optimization (see below). We adopt nG = 12 Gaussians with diagonal covariance matrices. This preserves universality as long as nG is large enough (Bishop & Nasrabadi 2006).

In the joint framework, estimator, the initial value network ξθ , the evolution network fθ , and MDN are constructed as residual networks (ResNet; He et al. 2016) with three residual blocks and each block consisting of two fully connected layers of width 1024. The evolution network has an additional activation of a hyperbolic tangent to rectify the dynamic. To train the model, we minimize the negative log-likelihood (NLL) loss function

Equation (16)

To further optimize the microlensing parameters, we start from the mean positions of the multivariate Gaussians and minimize the light-curve χ2 values via the Nelder & Mead (1965) simplex algorithm. In this step, the source flux parameter fS is derived analytically for a given set of (q, s, u0, α) by maximizing the model likelihood (e.g., Poleski & Yee 2019).

4.2. Data Generation

For the purpose of learning the time-independent parameters, we generate light curves as functions of the parameter τ. This is equivalent to setting t0 = 0 and tE = 1 day. The other parameters are sampled in the same way as in Section 3.2.

Each light curve contains 500 data points randomly drawn within the interval − 2 ≤ τ ≤ 2. For a typical timescale tE = 20 days, this corresponds to an averaged cadence of ∼4 hr with a range from 24 minutes (lower 10%) to 8.8 hr (upper 10%). To mimic the missing data points due to bad weather and/or instrumental failures, we throw away data points within one randomly chosen light-curve segment out of the 25 equal-duration segments that constitute the whole light curve. This leaves a data gap of 3.2 days for the aforementioned timescale. Light curves and noises are generated in the same way as in Section 3.2. Finally, we only keep light curves with significant binary features, defined as ${\chi }_{\mathrm{pspl}}^{2}$/d.o.f. > 2.

Six data sets are generated, each having 105 light curves. We use five of them for training estimator and the last one for validation and testing. To further understand the performance of estimator on light curves with different cadences, we have also generated data sets with various numbers of data points within the interval −2 ≤ τ ≤ 2. For these additional data sets, we have assumed irregular sampling but no gap.

4.3. Performance of Estimator

We train estimator with the ADAM optimizer (Kingma & Ba 2014). The learning rate, initially set to 10−4, drops by ten percent after each epoch and reaches a minimum of 10−6. Training samples are labeled by their input parameters ω and divided into minibatches of size 128. The training took ∼1 day for 50 epochs on one NVIDIA Tesla V100 GPU, and the average loss drops to −8.4 on the validation set of size 1024.

We apply the optimized estimator to 16,384 light curves from the test set and obtain the probability distribution of the microlensing parameters given by the summation of 12 multivariate Gaussians. We then apply the Nelder & Mead (1965) simplex algorithm to get the refined microlensing parameters. We show in Figure 10 the cumulative distributions of the model χ2 values of two selected sets of parameters, one that has the smallest χ2 value and the other that is closest to the input position in the parameter space. The median χ2 values are ∼500 for both solutions, comparable to the number of data points in the simulated light curve and thus approximately the degrees of freedom. This suggests that the found solutions can indeed describe the data very well. Even though in ∼20%–30% of cases the model χ2 values are substantially large compared to the number of data points, the returned parameters are still good representatives of the input parameters, as will be discussed later. The Appendix shows an example event whose model prediction gives large χ2 and yet reasonably accurate binary parameters.

Figure 10.

Figure 10. Cumulative distributions of the χ2 values of the output microlensing models of estimator. In blue and red curves models with the minimum χ2 value and the closest distance (in parameter space) to the input model are used, respectively. The vertical dashed line indicates χ2 = 500, which is approximately the medians for both curves.

Standard image High-resolution image

We have included a mixture of nG = 12 Gaussians in the density estimate. Is it flexible enough? To answer this question, we first look at the distribution of the Gaussian weights given by estimator. As shown in Figure 11, the probability distribution is dominated by one Gaussian in ∼50% of the tested events and by the top three Gaussians with the highest weights in nearly all tested events. As for the optimized solutions, 74% and 85% of the minimum χ2 and minimum RMSE solutions are given by the top four Gaussians, respectively. Only in ≲1% of the tested events does the Gaussian with the lowest weight lead to either solution. Therefore, we conclude that 12 Gaussians can satisfactorily describe the probability distribution of microlensing model parameters.

Figure 11.

Figure 11. Cumulative distributions of the normalized weights in MDN. Here π0, π1, and π2 are the top three highest weights in our MDN of 12 multivariate Gaussians. The blue curve indicates that there is one dominating Gaussian in about half of the cases, and the green curve shows that the combination of the top three Gaussians dominates the density distribution in almost all cases.

Standard image High-resolution image

We compare the estimator predictions before and after the simplex optimization with the input parameters in Figures 12 and 13. For each tested event, we choose as the model prediction the set of parameters that is closest to the input position in the parameter space, in order to minimize the impact of degenerate solutions. The results change only marginally if we pick the Gaussian with the highest weight as the model prediction, as seen in Figure 14. We use the median absolute deviation (MAD) to evaluate the agreement between input and prediction. Compared to RMSE, MAD is more robust against outliers. As shown in Figure 12, the direct output of estimator already contains a set of parameters that is fairly close to the input truth. Among all parameters, binary mass ratio q and separation s are of special interest to the physical interpretation of the event. The direct output of estimator yields a typical deviation of 0.057 (0.0155) for lgq (lgs), corresponding to a fractional uncertainty of 13% (3.6%) on q (s). The accuracy of the prediction is further improved after the optimization, as shown in Figure 13. Specifically, the optimized prediction can achieve median deviations of ∼0.03 and ∼0.006 for lgq and lgs, corresponding to fractional errors of ∼7% and ∼1.4% in q and s, respectively. Overall, the agreement between the input and predicted values is very well, demonstrating the high accuracy of our method on light curves with realistic sampling conditions.

Figure 12.

Figure 12. Comparisons between the input parameter values and the direct output of estimator. Out of the 12 Gaussian means of MDN, we identify the one that is closest to the ground truth in the parameter space and take it as the prediction. The median absolute deviation (MAD) is used to evaluate the agreement, and the RMSE is also provided for reference.

Standard image High-resolution image
Figure 13.

Figure 13. All except the lower right panel show comparisons between the input and the optimized parameter values. From the means of the 12 multivariate Gaussians, we apply Nelder & Mead (1965) simplex method and obtain optimized model parameters. The set of parameters closest to the ground truth in the parameter space is used in these comparisons. The lower right panel shows the prediction accuracy of microlensing parameters evaluated on events with various numbers of data points and no gap. Here MAD500,gap is the value of MAD evaluated on events with 500 points and data gap, as incidated in the top left corner of the other panels.

Standard image High-resolution image
Figure 14.

Figure 14. Similar to Figure 12 except that the Gaussian with the largest weight is taken as the prediction.

Standard image High-resolution image

We have also applied the optimized estimator to light curves in the additional data sets, which have different averaged cadences but no data gap. 10 The results are shown in the lower right panel of Figure 13 for the optimized prediction. There are a few interesting trends to report. First, the performance of estimator is better, although fairly marginally, on events with 500 points and no data gap. Second, the prediction accuracy decreases with decreasing number of data points as expected, but the degradation is typically small. Specifically, with 200 data points, the prediction accuracy is only worse by ∼100.13 − 1 ≈ 35% compared to the default case with 500 data points. The only exception is seen in the case with 100 data points, for which the performance is reduced by a factor of ≳2. This is probably because the averaged cadence becomes too low to constrain binaries with mass ratios down to 10−3. 11 Nevertheless, the overall uncertainty in the case of 100 points remains fairly small and acceptable. These results may provide useful guidance to the future application of our method to real light curves with variant cadences and/or the detection of low-mass-ratio planets.

We show in Figure 15 an example event whose anomalous feature is only partially covered by data points. The probability distribution from estimator, represented by the multivariate Gaussians, is already in the neighborhood of the ground truth position. The optimization with Nelder & Mead (1965) algorithm leads to an even better match in microlensing parameters as well as the model of the light curve. Events like this one, which has irregular samplings and large data gaps, pose challenges to machine-learning methods that are based on the standard RNN approach.

Figure 15.

Figure 15. An example event in which the binary caustic-crossing feature is partially missing due to the injected data gap, as shown by the light curve in the top right corner. The red color is for the input model, whereas the blue color is for the prediction given by estimator. The corner plot shows the probability distributions of the four MDN Gaussians with the highest weights, with the colors indicating the relative weights. Contours denote the 1σ boundaries of the multivariate Gaussians with diagonal covariance matrices. As we have chosen axis limits to optimize the overall appearance, not all four contours are visible in every panel. Values and positions of the ground truth and the prediction of estimator on the corner plot are also indicated. Even without the final optimization, the Gaussian distribution with the highest weight gives a fairly good approximation of the input parameters.

Standard image High-resolution image

Our method also identifies multiple solutions in events that suffer model degeneracy. We defer to Section 5.2 for further discussions on this issue.

5. Joint Pipeline and Real Event Application

5.1. Joint Pipeline

Combining the methods described in Sections 3 and 4, the joint pipeline of MAGIC works as follows. First, we feed the light curve into locator, which outputs the predicted t0 and tE. If tE is small, we can further refine it using locator with a larger k. Then we use the predicted t0 and tE to shift and rescale the light curve. The resulting light curve is processed by estimator, which in return gives the positions and shapes of nG = 12 Gaussians in the parameter space. If needed, these Gaussians can be further optimized to obtain more accurate predictions (see Section 4.1).

The performance of the combined pipeline is largely determined by the propagation of inaccurate parameters from locator to estimator. We therefore test the joint pipeline on 16,384 light curves from the test set of locator. After applying locator, we transform the light curves using the predicted values of t0 and tE. The transformed light curves are then sent to estimator to infer the time-independent microlensing parameters. The set of parameters closest to the position of the input values in the parameter space is adopted as the final prediction.

The comparisons between the input and the predicted values of the time-independent parameters are shown in Figure 16. The typical, fractional deviation in q and s are 25% and 9%, respectively. These deviations remain small, even though they are worse than the estimator-only case by a factor of ∼2. Overall, the joint pipeline achieves fairly high prediction accuracy in the time-independent parameters even with the use of the inaccurate time-dependent parameters. We expect the accuracy to be further improved once the predictions are optimized via the Nelder & Mead (1965) algorithm, as seen in the estimator-only case.

Figure 16.

Figure 16. Performance of the joint pipeline. Simulated light curves for locator are used in this test. After the application of locator, the light curve is sent to estimator for the inference of the time-independent parameters.

Standard image High-resolution image

A substantial fraction of binary events can be well described by a Paczynski (1986) model plus relatively short binary perturbation. For such events, one may mask out the anomalous feature and fit a single-lens model to obtain precise and accurate peak time t0 and event timescale tE. Combining these values with estimator, we can achieve nearly as accurate time-independent parameters as we do in the estimator-only case.

It is worth noting that the light curves generated for locator are similar to but not the same as those used for training estimator. First, locator uses binary light curves with ${\chi }_{\mathrm{pspl}}^{2}/{\rm{d}}.{\rm{o}}.{\rm{f}}.\,\gt \,1.25$, whereas estimator uses binaries with ${\chi }_{\mathrm{pspl}}^{2}/{\rm{d}}.{\rm{o}}.{\rm{f}}.\,\gt \,2$. Both correspond to a Δχ2 > 500 between single and binary models, under the reasonable assumption that the binary signals are usually local features. 12 Furthermore, the time window is fixed and the timescale parameters vary in the data set for locator. This means the number of data points within −2 ≤ τ ≤ 2 varies too. Nevertheless, we have shown in Section 4.3 that the performance of estimator is only mildly affected by the various length of input time series.

In addition to its high accuracy, our method demonstrates high efficiency in parameter estimation. Specifically, locator and estimator can process ∼1000 and ∼400 light curves per second on a single GPU, respectively. As shown in Figure 13, the output of estimator is already accurate enough for purposes like statistical studies. If needed, the optimization step can be performed. Compared to the two machine-learning steps, this traditional optimization step is slower, but there are only up to 12 initial positions to be optimized, which altogether may require up to a few thousand light-curve evaluations. For comparison, the traditional approach usually requires optimizations at individual grid points on a fine grid in the (lgq, lgs, α) space (e.g., Albrow et al. 2000; Wang et al. 2022). Our approach therefore represents a speedup by factors of ∼102–105.

5.2. A Real Event Application

We apply MAGIC to KMT-2019-BLG-0371 (hereafter KB190371 for short), which is a short-timescale (tE = 6.5 days) binary event with q ≈ 0.08 or 0.12 (Kim et al. 2021). KB190371 is chosen for this demonstration based on several considerations. First, its caustic-crossing feature is prominent and U-shaped, which is typical for binary events. Second, this event suffers from close/wide degeneracy (Griest & Safizadeh 1998; Dominik 1999) 13 , thus enabling additional tests of MAGIC. Furthermore, except for the relatively short timescale, the other microlensing parameters are fairly typical in comparison to those used in our simulation.

The available data sets from Kim et al. (2021) are not aligned to the same magnitude system. In order to align the data sets and then apply MAGIC, we mask out the caustic-crossing feature, fit a Paczynski (1986) model to the data, and use the best-fit flux parameters to align all other data sets to the OGLE one. We note that, although here we have made use of the special property of KB190371 to align the data sets, such a data alignment task is the same as deriving the source color and can be performed without reference to specific microlensing models (Yoo et al. 2004; Gould et al. 2010a), as has been practiced in the analysis of many real microlensing events (e.g., Gould et al. 2010b; Zhu et al. 2017).

To apply our joint pipeline, we first preprocess the data and make the light curve similar to those used in the training procedure. To be more specific, we only include data points falling in the time interval of about 100 days centered at the anomaly, and filter out data points with large (>0.1 mag) error bars. KB190371 was observed with a high cadence, so we bin every five consecutive points into one. Finally, the light curve is shifted to the same base magnitude as those used in locator.

Then we feed the preprocessed light curve into locator. Because the timescale is relatively small, we set k = 2. The output mask is rectified into binary values {0, 1} with a threshold of 0.5. This step gives the prediction ${t}_{0}\,(\mathrm{HJD}^{\prime} )\,=8592.391$ and tE = 6.68 days, both of which are close to the values are given in Kim et al. (2021). We transform the light curve with the predicted t0 and tE and only keep the segment within −2 ≤ τ ≤ 2.

The rest of the application follows closely the procedure given in Section 5.1. With the positions of the 12 multivariate Gaussians given by estimator, we perform Nelder & Mead (1965) optimization to further refine the microlensing parameters. Because the light curve shows a prominent finite-source effect, we also include in the optimization the scaled source size ρ.

The results are shown in Figure 17. The top two Gaussians with the highest weights match closely to the close (s < 1) and wide (s > 1) solutions, respectively. The corresponding model light curves from the optimized solutions both match the data reasonably well, with almost indistinguishable χ2 values. According to these solutions, the event can be explained by a binary with (lgq, lgs) = (−1.16, −0.057) or (−0.95, 0.180). Given the imperfect alignment as well as the slightly different preprocessing procedures, we consider these values in good match with those given by the discovery paper.

Figure 17.

Figure 17. The application of MAGIC to KB190371 and the output. The light curve and the MAGIC predictions are shown in the top right panel. Two solutions, labeled "close" and "wide," have comparable χ2 values, and the corresponding microlensing geometries (i.e., caustic curve and source trajectory) are shown below the light-curve plot. The corner plot shows the positions of the optimized solutions as well as the top two Gaussians that correspond to the "close" and "wide" solutions.

Standard image High-resolution image

Kim et al. (2021) found the source scaled parameter ρ to be (6.46 ± 0.17) × 10−3 and (6.99 ± 0.21) × 10−3 for the close and wide solutions, respectively. These are quite different from the value (10−3) that we have used to train MAGIC. Nevertheless, the optimized parameters yield ρ = 6.6 × 10−3 and 7.3 × 10−3 for the close and wide solutions, respectively, in good agreement with the values from the discovery paper. In other words, the presence of the finite-source effect in KB190371 does not seem to confuse MAGIC, even though MAGIC has not learned to infer the source scaled parameter ρ. This is probably because the finite-source effect in KB190371 only affects a limited number of data points around the caustic entrance and exit. While the majority of binary microlensing events bear the same feature, some may have their binary signals substantially modified by the finite-source effect (e.g., Hwang et al. 2018). Therefore, it may be necessary to include in MAGIC the ability to infer ρ. See Section 6 for further discussion.

To further demonstrate the capability of MAGIC to handle large data gaps, we exclude observations from the South Africa site of KMTNet. This creates a substantial gap during the exit of the caustic crossing, as shown in the top right panel of Figure 18. The same procedure is applied as before, and the results are shown in Figure 18. The two degenerate solutions are again identified at high accuracy, indicating the robustness of MAGIC in handling large data gaps as well as model degeneracy.

Figure 18.

Figure 18. Similar to Figure 17 except that observations from the South Africa site of KMTNet are excluded in this analysis. As a result, the exit portion of the caustic-crossing feature is not covered with observations. Even so, MAGIC is able to identify the two degenerate solutions at high accuracy.

Standard image High-resolution image

6. Discussion

We present MAGIC, a machine-learning method that is readily applicable to the analysis of binary microlensing light curves from ongoing and future photometric surveys. We divide the microlensing model parameters into two groups and develop separate deep learning neural networks. For the inference of time-dependent parameters, namely the peak time t0 and the event timescale tE, we construct locator, which is a U-Net architecture (Ronneberger et al. 2015). For the inference of time-independent parameters, namely (u0, q, s, α, fS), we construct estimator, which uses neural CDE (Kidger et al. 2020; Kidger 2022) to extract features and mixture density network (MDN; Bishop & Nasrabadi 2006) to infer the probability distribution. The output parameters can be further refined through the standard optimization method. Depending on the properties of the event under investigation, the two components can be used separately or jointly to achieve the best performance.

A little MAGIC can take you a long way. 14 We train and validate both locator and estimator as well as the joint pipeline on simulated light curves. To mimic the sampling condition in the realistic world, we introduce irregular sampling and a large data gap in the light-curve simulation. As demonstrated in previous sections, MAGIC can recover the input parameters at high accuracy for the majority of tested events. In particular, estimator alone can achieve typical fractional errors of ∼7% and ∼1.4% for the binary mass ratio q and the separation s, respectively. As demonstrated in the analysis of a real microlensing event (KB190371; Kim et al. 2021), MAGIC is also capable of identifying degenerate models that explain the same data. The proposed method is fast to perform, with an estimated speedup of ∼102–105 depending on whether or not the optimization is needed.

While MAGIC is readily applicable to the analysis of real-world data sets, a few improvements can be made to make MAGIC more powerful and more robust. First, this work focuses on binaries with relatively large mass ratios (q > 10−3), and extension into lower values may be necessary in order to apply MAGIC to the analysis of low-mass planetary events. Second, the source scaled size ρ is not inferred in the current version of MAGIC for reasons given in Section 3.2, although it can still be constrained in the optimization stage (see Section 5). The inclusion of ρ into MAGIC involves effectively both classification and regression tasks, as one will need to determine whether or not the finite-source effect is detected and if yes, infer the value of ρ. Furthermore, the inclusion of additional effects—especially the microlensing parallax effect (Gould 1992) and the binary source effect (Griest & Hu 1992)—may be necessary in order for MAGIC to provide robust estimates on binaries with relatively weak signals. Last but not least, MAGIC can also be applied to the analysis of microlensing events with more than two lenses, for which the sampling-based method becomes much more time consuming (e.g., Gaudi et al. 2008; Kuang et al. 2021).

MAGIC uses neural CDE to handle irregular sampling and data gaps. Compared to other existing methods that have been used to handle the same issues in astronomical time series (e.g., Charnock & Moss 2017; Naul et al. 2018), neural CDE takes directly the irregularly sampled time series and models the underlying process with a differential equation. This approach has been demonstrated to achieve good performances in other disciplines that also involve time series data (e.g., Kidger et al. 2020). As irregular sampling and data gaps are common in astronomical observations, we expect that neural CDE will be useful to many other fields in time domain astronomy.

We thank Zerui Liu and Minghao Liu for their discussions. We also thank the anonymous referee for their comments and suggestions on a previous version of the manuscript. This work is supported by the National Science Foundation of China (grant Nos. 12173021 and 12133005) and the science research grant from the China Manned Space Project with No. CMS-CSST-2021-B12. We also acknowledge the Tsinghua Astrophysics High-Performance Computing platform for providing computational and data storage resources.

Software: NumPy (Harris et al. 2020), SciPy (Virtanen et al. 2020), Pandas (Wes McKinney 2010), Matplotlib (Hunter 2007), PyTorch (Paszke et al. 2019), MulensModel (Poleski & Yee 2019), corner (Foreman-Mackey 2016), Jupyter (Kluyver et al. 2016), torchcde (Kidger 2022).

Appendix: Events with Large χ2

As described in Section 4.3, estimator achieves reasonably high accuracy in microlensing parameters but fails to obtain good light curve fits (i.e., small χ2 values) in ∼20%–30% of the test cases. This phenomenon arises because a small deviation in the parameter estimation may still lead to a large deviation in the light-curve behavior, thus resulting in a large χ2 value. Here we examine a specific example, which is the event with the largest χ2 value in the test set. The predicted parameters only differ from the ground truth values by (0.02, 0.04, 0.02, 0.04) in (u0, lgq, lgs, α/180°). However, the corresponding light curve deviates from the ground truth drastically, as shown by the red curve in the top panel of Figure 19. To illustrate the impact on the light-curve behavior caused by minor changes in the parameter values, we change one by one the values of α, u0, and lgs from the model prediction to the ground truth values, and plot the corresponding light curves in the same figure. We observe that the light-curve fit remains bad until all three parameters are fine tuned. This confirms that even when the parameter estimation is accurate enough, the light-curve fit may still look bad. Nevertheless, an accurate estimation of the parameters is enough for most applications.

Figure 19.

Figure 19. An example event for which the raw prediction of estimator yields the largest χ2 value among all test events. The top panel shows the data and model light curves, and the middle and bottom panels are the corresponding microlensing geometry plots. The raw prediction of estimator is shown in red, and the ground truth is shown in black. We then change one by one the values of α, u0, and s from the raw prediction to the ground truth values and show the corresponding light curves and lensing geometry plots with different colors.

Standard image High-resolution image

Footnotes

  • 3  

    This is the same in MulensModel (Poleski & Yee 2019).

  • 4  

    This value of nt is chosen to be larger than the number of time stamps (2000; see Section 3.2). The resulting time resolution, 270/4000 ≈ 0.07 days, is much smaller than the RMSE (2.2 days, see Section 3.3) of t0.

  • 5  

    This crude approach of directly interpolating the irregular time series is justified given that the temporal parameters depend mostly on the macroscopic behavior of the light curve, rather than the fine structure within a small time interval. This assumption is no longer valid for the parameters discussed in Section 4.

  • 6  

    Here "pspl" represents for point-source point lens.

  • 7  

    This includes light curves used in Figure 4 with tE < 40 and additional ones in the test set that add up to 16,384.

  • 8  

    In real applications, one can and should preprocess the original light curve via, for example, outlier removal and data point rebinning, to make its data quality similar to that of simulated light curves.

  • 9  

    Note that the use of cubic splines does not mean that neural CDE is just another interpolation scheme. This is solely for creating a continuous embedding of the discrete input. In contrast, standard interpolation schemes only make use of the values at fixed time stamps. A detailed discussion on this issue can be found in Kidger (2022).

  • 10  

    That is, we directly apply estimator without retraining.

  • 11  

    The averaged cadence in terms of τ is 4/100, and thus the smallest binary mass ratio that it can detect is ≳(4/100)2 = 1.6 × 10−3.

  • 12  

    Although we have focused on such binaries with strong signals, we expect our method to work on weak binaries as well.

  • 13  

    We use this term to keep consistent with the original discovery paper, even though recent studies have suggested that such "close/wide degeneracy" should be called offset degeneracy (Zhang et al. 2022).

  • 14  

    From Roald Dahl.

Please wait… references are loading.
10.3847/1538-3881/ac9230